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Abstract. The saturation-based reasoning methods are among the most theoretically developed ones and are used 
by most of the state-of-the-art first-order logic reasoners. In the last decade there was a sharp increase in performance 
of such systems, which I attribute to the use of advanced calculi and the intensified research in implementation 
techniques. However, nowadays we are witnessing a slowdown in performance progress, which may be considered 
as a sign that the saturation-based technology is reaching its inherent limits. The position I am trying to put forward 
in this paper is that such scepticism is premature and a sharp improvement in performance may potentially be 
reached by adopting new architectural principles for saturation. The top-level algorithms and corresponding designs 
used in the state-of-the-art saturation-based theorem provers have (at least) two inherent drawbacks: the insufficient 
flexibility of the used inference selection mechanisms and the lack of means for intelligent prioritising of search 
directions. In this position paper I analyse these drawbacks and present two ideas on how they could be overcome. 
In particular, I propose a. flexible low-cost high-precision mechanism for inference selection, intended to overcome 
problems associated with the currently used instances of clause selection-based procedures. I also outline a method 
for intelligent prioritising of search directions, based on probing the search space by exploring generalised search 
directions. I discuss some technical issues related to implementation of the proposed architectural principles and 
outline possible solutions. 



1 Introduction 

An automatic theorem prover for first-order logic (FOL) is a software system that can be used to show that 
some conjectures formulated in the language of FOL are implied by some theory. The expressiveness of 
FOL and its relative mechanisability make automated theorem proving in FOL a useful instrument for such 
applications as verification M5I4I1I6H and synthesis |fl9l of hardware and software, knowledge representa- 
tion [18], Semantic Web lfT6l , assisting human mathematicians 112 1131 . background reasoning in interactive 
theorem provers |[23l . and others. 

This paper is concerned with the theorem proving method based on the concept of saturation. Given an 
input set of formulas, the prover tries to saturate it under all inferences in the inference system of the prover. 
In order to deal with syntactic objects which allow efficient calculi, the input set of formulas is usually 
converted into a set of formulas of a special form, called clauses Q- Demonstrating validity of a first-order 
formula is thereby reduced to demonstrating unsatisfiability of the corresponding set of clauses H. The calculi 
working with clauses are usually designed in such a way that inferences can only produce clauses (see, e. g., 

H2E61). 

There are three possible outcomes of the saturation process on clauses: (1) an empty clause is derived, 
which means that the input set of clauses is unsatisfiable; (2) saturation terminates without producing an 
empty clause, in which case the input set of clauses is satisfiable (provided that a complete inference system 
is used); (3) the prover runs out of resources. The saturation method is well-studied theoretically ( 121261 ) 



1 Universally quantified disjuncts of literals. A literal is either an atomic formula (possibly depending on some variables), or a 
negation of such an atomic formula. 

2 Sometimes problems coming from applications are already represented in the clausal form or require only minor transformation. 



and is implemented in a significant number of modern provers, e. g., E P4l . E-SETHEO (the E component), 
Gandalf EH, Otter E2, SNARK (36|, Spass El, Vampire Il32l30l . and Waldmeister iflill . 

In the last decade there has been a sharp increase in performance of such system^], which I attribute 
to the use of advanced calculi and inference systems (primarily, complete variants of resolution [2] and 
paramodulation ll26l with ordering restrictions, and a number of compatible redundancy detection and sim- 
plification techniques), and intensified research on efficient implementation techniques, such as term index- 
ing (see [12J and more recent survey 051 ). heuristic methods for guiding proof search (see, e. g., EH) 
and top-level saturation algorithms (see, e. g., |[T3l and Ell). Unfortunately, the initial momentum created 
by such work seems to have diminished, and nowadays we are witnessing a slowdown in performance 
progress^- Some researchers consider this to be a sign that the saturation-based reasoning technology is 
reaching its inherent limits. The position I am trying to defend in this paper is that such scepticism is 
premature. My argumentation is based on a thesis that potential opportunities for a new breakthrough in 
performance have not been exhausted. Namely, the possibility of adopting new implementation frameworks 
for saturation, i.e., top-level designs and algorithms, has not been fully explored. To support this claim, 
I will pinpoint some major weaknesses in the organisation of proof search in the standard approaches to 
implementing saturation, and propose two concrete ideas on how to overcome these problems. 

First, I will analyse some inherent problems with the standard procedures for saturation, based on the 
implementation of inference selection via clause selection. In particular, I consider the two main procedures, 
the OTTER algorithm and the DISCOUNT algorithm, based on clause selection. The main problem with 
the former procedure is the coarseness of inference selection, which translates into insufficient productiv- 
ity of heuristics and restricts the choice of possible heuristics. The latter procedure implements very fine 
selection of inferences, but at a high cost in terms of computational resources. I will propose a new proce- 
dure based on a flexible high-precision inference selection mechanism with acceptable overhead. A concrete 
implementation scheme will be outlined. 

Second, I will highlight the inadequacy of the popular approaches to prioritising proof search directions, 
based on syntactic characteristics of separate clauses. As a possible remedy, I propose a method for intel- 
ligent prioritising of search directions, based on probing the search space by exploring generalised search 
directions. I also propose a concrete implementation scheme for the method. 

This criticism of the current state of affairs in the saturation architectures originates in my hands-on 
experience with implementing the saturation-based kernel of Vampire 11321301 . and numerous experiments 
with the system. In fact, I consider the observations related to proof search effectiveness, on which this paper 
is based, the most valuable lessons learned from the Vampire kernel implementation. However, this paper is 
only a position paper. As such, it does not present any complete results, either theoretical or experimental. 
Its aim is to provide a basis and an inspiration for new implementations and experiments. 

The rest of this paper is structured as follows. Each of the remaining two sections introduces a new 
architectural principle. In the beginning of the chapter the relevant aspects of the state-of-the-art designs are 
criticised. Then, ideas of a possible remedy are formulated, followed by a discussion of related work and a 
tentative research programme. 

Concluding this introduction, I would like to ask the reader to be tolerant to some presentational prob- 
lems with this text. I am trying to keep this paper informative for experts in the implementation of saturation- 



3 A good benchmark is Otter which has not changed much since 1996. Compare its relative performance in CASC-13 
(http://www.cs.miami.edU/~tptp/CASC/13/l and CASC-20 (http://www.cs.miami.edu/~tptp/CASC/20/). 

4 Compare the performance of the best provers in CASC-20 (http://www.cs.miami.edu/~tptp/CASC/20/) with the previous year 
winners. 
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based provers and, at the same time, acceptable for a superficial reading by a broader audience. Some nega- 
tive consequences of such conflict of intentions seem to be inevitable. 

2 General preliminaries 

For the sake of self-containedness, I will reproduce a number of standard definitions here. 

I am assuming that the reader is familiar with the syntax and semantics of first-order predicate logic with 
equality. In what follows, ordinary predicate symbols will be denoted by p, q and r, the equality predicate 
will be denoted by ~, function symbols will be denoted by /, g and h, individual constants will be denoted 
by a, b and c, variables will be denoted by x, y and z, possibly with subscripts, and the letters s and t, 
possibly with subscripts, will denote terms. 

We are mostly interested in a special kind of first-order formulas called clauses. A clause is a disjunction 
L\ V . . . L n , where all Li are literals, i .e. atoms (positive literals) or negated atoms (negative literals). The 
order of the literals in a clause is usually irrelevant, so I will often refer to clauses as finite multisets of 
literals. The empty multiset of literals will also be considered a clause which is false in any interpretation. 

Substitutions are total functions that map variables to terms. They will be denoted by 9 and a, possibly 
with subscripts. Substitution application is extended to complex expressions, such as terms, atoms, literals 
and clauses, in an obvious way: if E is an expression, Ed is obtained by replacing each variable x in E by 
xO. A substitution 9 is a unifier for two expressions E\ and E% if E\9 = E2Q. It is the most general unifier, 
if for any other unifier 9\, there exists a substitution #2, such that E\9\ = (E\9)92. 

We will say that a clause C subsumes clause D if there is a substitution 9 such that (the multiset of 
literals) C9 is a submultiset of D. 

We are interested in implementation of calculi based on resolution and paramodulation (see, e. g., 
M2I2611 ). (Unrestricted) binary resolution is the following deduction rule: 

CvA Dy^B 
(C V D)9 

where 9 is the most general unifier of the atoms A and B. 
Paramodulation is the following rule: 

CVs~£ D[u] 
(CV D[t})9 

where 9 is the most general unifier of the terms s and u, and u is not a variable. 

A resolution-based reasoner usually applies restricted variants of these rules together with some auxiliary 
rules to demonstrate unsatisfiability of an input clause set by deriving an empty clause from it. Such deriva- 
tions are called refutations of the corresponding clause sets. 

Saturation-based reasoners are called so because of the way they search for refutations. In an attempt to 
derive an empty clause, a reasoner tries to saturate the initial set with all clauses derivable from it. Roughly 
speaking, at some steps of the saturation process the reasoner selects a possible inference between some 
clauses in the current clause set, applies the inference and adds the resulting clause to the current clause set. 
Other steps of the process usually prune the search space by removing redundant clauses, i. e. clauses that 
are not strictly necessary to find a refutation. For details on the concept of saturation modulo redundancy, 
the reader is referred to |2). 
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3 Fine inference selection at affordable cost 



3.1 Background: inference selection via clause selection 

When one has to search in an indefinitely large space, the ability to explore more promising search directions 
before the less promising ones is a key to success. In saturation-based reasoning the mechanism responsi- 
ble for deciding on which direction to promote first is known as inference selection. Ideally, the inference 
selection should be able to name one single inference to be deployed at every step of saturation, and the de- 
cision should be based on the (heuristically evaluated) quality of the resulting clause. In practice, most of the 
working saturation-based systems adopt a simpler but coarser mechanism known as clause selection. Instead 
of selecting a single inference at a time, we select a clause and oblige to deploy immediately all possible 
inferences between the clause and all active (previously selected) clauses. Clauses of better heuristically 
evaluated quality are given higher priority for selection, in the hope that they will produce heuristically 
good inferences. The algorithm realising inference selection via clause selection is known as given-clause 
algorithm. Its variants have been used in provers since as early as 1974 ll27l (see also [20]), although its 
current monopoly seems to be mostly due to the success of Otter [22]. Other provers based on variants of 
given-clause algorithm include E, Gandalf, SNARK, Spass, Vampire and Waldmeister, i. e., practically all 
modern saturation-based systems. 

In order to illustrate the main idea behind the given-clause algorithm, namely the implementation of in- 
ference selection via clause selection, it is sufficient to consider only deduction inferences. So, the algorithm 
presented in Figured] performs no simplification steps. 



procedure GivenClause(input : set of clauses) 
var new, passive, active : sets of clauses 
var current : clause 
active := 
passive := input 
while passive 7^ do 

current := select(passive) 

passive := passive — {current} 

active := active U {current} 

new := infer (current, active) 

if new contains empty clause 
then return refutable 

passive := passive U new 

od 

return failure to refute 
Fig. 1. Given-clause algorithm (without simplifications) 



It is also convenient to represent the algorithm with a more abstract dataflow diagram as in Fig- 
ure [2] In this picture, the boxes denote operations performed on clauses. The rounded boxes denote sets 
of clauses. The shallow ones correspond to the sets that typically contain very few clauses, while the 
deep ones correspond to the sets that can grow large. The arrows reflect the information flow for differ- 
ent operations. Arrows labeled with the same number belong to the same operation/processing phase. In 
Figure El label 1 corresponds to the line passive := input in the pseudocode from Figure [Q phase 2 
is clause selection (current := select(passive) and passive := passive — {current}), 3 corresponds 
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Fig. 2. Dataflow in the given-clause algorithm 



to active := active U {current}, 4 is the generation of deduction inferences between current and 
active (new := infer (current, active)), and 5 is the integration of newly derived clauses into passive 
(passive := passiveUnew). The thin solid arrows show the movements of clauses between clause sets and 
from operations to the sets. A dashed arrow from a set to an operation indicates that the operation depends 
on the clauses from the set. 

My experience with Vampire and, to some extent, with other provers allows me to see a number of soft 
spots of the given-clause algorithm: 

- The selection is based on the properties of eligible clauses, which are only vaguely related to the prop- 
erties of the enabled inferences. 

A "good" clause may interact with many previously selected "not-so-good" clauses and produce many 
"not-so-good" inferences. The set of selected clauses often contains such heuristically bad clauses for a 
number of reasons. In particular, we cannot completely avoid selecting bad clauses because in general it 
leads to incompleteness. Moreover, in practice we often cannot even significantly restrict the selection 
of heuristically bad clauses since such strategy easily leads to loss of solutions (in the practical sense, 
i.e., solutions that can be obtained with given resources). Another reason why bad clauses get into the 
set of selected clauses is the relativity of the heuristic estimation of clause quality: a clause selected 
as relatively good in the beginning of the proof search, can become relatively bad later if many better 
clauses have been derived. 

Another problem with clause property-based selection is that even two "good" clauses can easily have 
"not-so-good" inferences between them. This happens when the clause quality criteria do not sufficiently 
penalise clauses containing "bad" parts available for inferences. If our quality criteria are too strict with 
respect to clauses with "bad" parts, the prover also postpones the inferences involving "good" parts of 
such clauses. 

- The newly selected clause may, and often does, interact with very many parts of very many active clauses. 
This often leads to pathological situations of the following kind: a prolific clause is selected and the 
processing of inferences between this clause and many active ones takes all available time, whereas a 
few inferences with other clauses would lead to a solution. 

In sum, the coarseness of the clause selection principle deprives us of control over the proof search pro- 
cess to a great extent, which translates into poor productivity of heuristics, restricts the choice of heuristics 
that can be implemented, and leads to littering the search state with too many "undesirable" clauses. 
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There are two main variants of the given-clause algorithm: the Otter algorithm^ and the DISCOUNT 
algorithrrH, which differ in the way the passive (waiting to be selected) clauses are treated. 




Fig. 3. Otter algorithm 



In the Otter algorithm, presented as a dataflow diagram in Figure [3) the passive clauses are subject to 
simplification by the newly derived clauses, can be discarded as redundant with the help of the newly derived 
clauses, and themselves can be used to simplify/discard the newly derived clauses. Newly derived clauses 
are subject to forward simplification which may transform them or even discard them completely Note 
that in the Otter algorithm forward simplification uses both passive and active clauses as simplifiers (see the 
dashed arrows labeled with 5 in the diagram). Backward simplification also affects passive clauses as well 
as the active ones (see the broad arrows labeled with 7). 

In the DISCOUNT algorithm (see Figure 0]), only active clauses can be simplified/discarded, or used 
to simplify/discard new clauses. So, there are no dashed lines between the box passive and the forward 
and backward simplification boxes. Note also that the clause in current is subject to forward simplification 
(arrows labeled with 3), and it is used to simplify the active clauses (arrows labeled with 4). This is done to 
keep the set of active clauses as simple as possible. 

In the DISCOUNT algorithm passive clauses are constructed practically exclusively for evaluation of 
their properties which have to be known for controlling the inference selection. One may argue that the set 
of passive clauses is just a representation of all (potentially non-redundant) one-step inferences from the 
active clauses, and from this point of view the DISCOUNT algorithm implements the idealistic notion of 
inference selection described in the beginning of Section 13.11 In other words, the DISCOUNT algorithm 
allows the prover to observe the space of all possible one-step inferences between active clauses, which 
is a good thing by itself. However, the algorithm also obliges the system to do so by explicitly making all 
such inferences and storing the resulting clauses as passive. The cost of good inference selection becomes 
very high. Typically, a thousand active clauses may generate hundreds of thousands inferences, and a great 

5 Implemented, in particular, in Gandalf, Otter, SNARK, Spass and Vampire. 

6 Implemented, in particular, in E, Vampire and Waldmeister. 

7 In diagrams on Figures[3]and[4] a broad arrow from an operation to a set indicates that the operation modifies the set by removing 
or replacing some clauses. 

8 Simplicity here is, of course, relative to the features of the used inference system, in particular, the redundancy criteria. 
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Fig. 4. DISCOUNT algorithm 



deal of the resulting clauses may be non-redundant with respect to the active ones, and, as such, have to 
be stored as passive. Since the passive clauses are not used for anything but selection, the work spent on 
constructing a clause may be frozen for a long time, while the clause remains passive, and this work is lost if 
the prover exhausts a given time or memory limit and terminates. Storing huge numbers of passive clauses 
may additionally require a lot of memory. 

The Otter algorithm is not completely immune from any of these problems too. In addition, the cost of 
simplification operations grows with the growth of the set of passive clauses. 

Recently there have been (at least) two attempts to address some of these issues. Vampire implements the 
Limited Resource Strategy [ 331] , which is intended to minimise the amount of work on generating, process- 
ing and keeping passive clauses in the Otter algorithm, which is wasted when the time limit is reached. This 
is done by discarding some non-redundant but heuristically bad clauses and inferences. Waldmeister imple- 
ments a sophisticated scheme to reduce the memory requirements by the DISCOUNT algorithm 1113181 . In 
both cases, the adjustments of the top level algorithms led to a great improvement in the effectiveness of the 
systems. This gives me hope that a radically different approach to inference selection may result in a real 
performance breakthrough. 

3.2 Finer selection units with graded activeness 

Finer selection units. The inherent problems with the given-clause algorithm motivated me to look for a 
scheme that can facilitate better control of search at an affordable cost. Instead of selecting clauses, we are 
going to select some particular parts (literals or subterms) of clauses and make them available for some 
particular kinds of inferences . Such triples (clause + clause part + inference rule) will be the new selection 
units. This will help us to avoid premature invocation of less promising clause parts. 

Stronger heuristics become available for evaluating the quality of selection units since such evaluation 
can take into account more than just integral characteristics of a whole clause. For example, a selection unit 
with a generally good clause, but with a bad literal or subterm intended for a prolific inference rule may now 
be given a low priority. On the one hand, this allows us to delay inferences with a bad part of the clause. On 
the other hand, we don 't have to delay all inferences with the clause simply because one of its parts is bad. 

As an illustration, consider the unit clause p(f(x,y),f(a,b)). If some form of paramodulation is al- 
lowed, the subterm f(x, y) is available for paramodulation into. This selection unit is extremely prolific 
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since f(x, y) unifies with all terms starting with /, and it makes a good sense to delay paramodulations into 
this term without postponing other inferences with the clause, e.g., paramodulations into /(a, b). 

Another example of a highly promising heuristic which is enabled by the proposed approach, is to give 
higher priority to binary resolution than to paramodulation, since the latter is often much more prolific than 
the former. This heuristic already works very well (at least) in Vampire. The prover never enables inferences 
with positive equalities in a clause if there are literals of other kinds. Although generally successful, this 
strategy often fails if all the other literals are relatively bad, e.g., if they can generate many inferences. 
Consider the clause p(x, y) V q(a, b) V f(a, b) ~ a. The literal p(x, y) is likely to be more prolific than 
f(a, b) ~ a, since p(x, y) unifies with any atom starting with p. The proposed scheme allows us to give very 
high priority to q(a, b), lower priority to f(a, b) ~ a since this is a positive equality, and a very low priority 
to the overly prolific literal p(x, y). 

Also, simplification inferences and redundancy tests can be treated in the same way as deduction in- 
ferences. In the example above, we could make the term f(a, b) available for rewriting immediately, and 
postpone the integration of f(x, y) into the corresponding indexes until much later. By delaying simplifica- 
tion inferences on stored clauses in a controlled manner we can achieve behaviours combining the properties 
of the Otter and DISCOUNT algorithms. If simplification inferences are given higher priority, the behaviour 
of our procedure will be closer to that of the Otter algorithm. If simplification inferences have priority com- 
parable to the priority of deduction inferences such as resolution and paramodulation, we can expect the 
new procedure to behave similar to the DISCOUNT algorithm. 

Graded activeness. Apart from changing the subject of selection, I propose to change the notion of selection 
itself. The given-clause algorithm divides the search state in two parts. One part contains active clauses, and 
the other one contains passive clauses that are not yet available for deduction inferences. If a clause gets 
into the active set, it becomes available for all future inferences regardless of its quality. To overcome this 
problem, I propose to use, finer gradation of selection unit activeness. 

Intuitively, all selection units would become potentially available for inferences almost immediately, but 
some would be "more available" than the others. Less active selection units would be available for inferences 
with more active ones. High degree of activeness of a selection unit would indicate higher priority of this 
unit for proof search. In the new procedure, the units containing parts of newly generated clauses initially 
receive the minimal degree of activeness, but later are gradually promoted to higher degrees of activeness. 
When a promotion step takes place, the selection unit becomes available for new inferences with some units 
which have not been eligible so far due to insufficient activeness. To give higher priority to inferences with 
heuristically better selection units, the promotion frequency for different units should vary according to their 
quality. Thus, we will be able to delay inferences between heuristically bad inference units. 

To illustrate this rather general scheme, I will outline a simple implementation scheme. For this imple- 
mentation the nature of the used selection units is irrelevant, i. e. they can be clauses as well as the finer 
selection units proposed above. However, the implementation relies on the assumption that the quality of 
selected units is reflected by a special real-valued coefficient which takes positive values. 

If v is a selection unit, the corresponding coefficient will be denoted as quality{v). The intuitive mean- 
ing of the quality coefficient is the relative frequency of promotion. If v\ and V2 are two selection units, at 
each promotion step the probability of selecting v\ for promotion relates to the probability of selection V2 
as quality (v\) relates to quality{v{). Practically, we can select units for promotion randomly according to 
the distribution explicitly specified with the quality coefficients. This selection discipline is known in the 
area of Genetic Algorithms as roulette-wheel selection ifTTl . 

To realise the idea of graded activeness, I propose to partition the set of all available selection units 
into n + 1 sets T , T\, . . . , T n . The indexes of the sets reflect the activeness of selection units contained in 
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them: units in T- l+ i are more active than units in Tj. More specifically, for i > 0, v £ Ti implies that all 
possible inferences between v and units from T n _j+i, . . . ,T n have been made and no inferences between v 
and units in To,. . . , T n ^i have been considered yet. Tq contains absolutely passive selection units, i. e. units 
that have not participated in any inferences yet. This invariant, illustrated in Figure [5j is maintained by the 
following procedure. As soon as some selection unit is constructed, it is placed in To. At each macrostep of 
the procedure some selection unit v is selected for promotion as outlined above. If v happens to be in Tj, 
where i < n, its promotion means that v is removed from Ti, all possible inferences between v and selection 
units from X„_j are made, and v is placed in Ti+i. Selection units from T n are not promoted, they have the 
maximal activeness. 
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Fig. 5. Graded activeness implementation 



Special arrangements may have to be made if we admit selection units that need not interact with other 
selection units to produce inferences. For example, we may decide that selection units intended for binary 
factoring ([2]) with a particular literal in a particular clause do not need a counterpart unit, i. e. if we decide 
to deploy such a unit, we will have to make all possible factoring inferences with the specified literal within 
the specified clause. One possibility of dealing with such selection units is to designate some activeness 
i > as a threshold, so that when a selection unit reaches Ti, all inferences requiring this unit alone are 
immediately made. 

As a whole, the proposed inference selection scheme allows for much better control over inference 
selection which may translate into higher productivity of heuristics and enables the use of new heuristics 
which could not be used with the given-clause algorithm. Apart from other things, the extra flexibilit)0 
of inference selection will enhance the diversity of available strategic J0. These advantages come at an 
affordable cost. The only involved overhead, caused by the need to store large numbers of selection units, 
is compensated by lower numbers of heuristically bad clauses which have to be created and stored only to 
maintain completeness. 

I would like to add one final consideration here. The calculi used in the state-of-the-art saturation- 
based provers are designed with the aim of reducing search space. Partially, they do this by restricting 
the applicability of resolution and paramodulation rules. Often this is done by prohibiting inferences with 

9 The proposed design is strictly more flexible than the standard ones since it is possible to implement it in such a way that both 
the Otter and DISCOUNT algorithms can be simulated by appropriate parameter settings. 

My experience suggests that this is a very important factor as in 2002-2005 the multitude of strategies supported by the Vampire 
kernel has been a major, if not the main, contributor to the growth of performance of the whole system. 
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certain parts of clauses. For example, ordered resolution with literal selection (see [2]) prohibits resolving 
non-maximal positive literals. However, restricting the shape of eligible derivations also means restricting 
the number of eligible solutions, and simple solutions are often thrown away if they do not satisfy the 
restrictions. 

It is possible, in principle, to relax the restrictions by allowing some redundant inferences with some 
heuristically good parts of clauses. For example, we may want to resolve large (i. e., containing many sym- 
bols) positive non-maximal literals with the aim of obtaining smaller resolvents. However, adjusting prover 
architectures based on the standard variants of the given-clause algorithm makes very desirable the introduc- 
tion of new ad hoc mechanisms for regulating the proportion of redundant and non-redundant inferences. 
The scheme proposed in this paper seems to have sufficient flexibility to accommodate such control mecha- 
nisms for free. For example, we can allow selection units with large positive non-maximal literals, and assign 
to them higher quality measure than to non-redundant selection units, if we are eager to derive small clauses 
earlier. If we choose to be more conservative and want to avoid most redundant inferences except a small 
number of very heuristically promising ones, we can always assign higher quality measure to non-redundant 
selection units. 

3.3 Methodological considerations 

The proposed scheme for finer inference selection is completely compatible with the modern theory of 
resolution and paramodulation, and requires no theoretical analysis. The difficult part of the job is to find an 
adequate design and to do the actual implementation. 

One of the implementation options is to adjust an existing system. To investigate this possibility, I looked 
through the code of the kernel of Vampire, v7.0, with the purpose of estimating the amount of work required 
to adjust it to the new scheme. This investigation has convinced me that at least one third of the code 
would have to be rewritten completely, and at least another third of it would have to be heavily adjusted to 
accomodate the new code. This is hardly surprising, taking into account that the proposed changes target the 
top level design as well as some key data representations and some mid-level functionality such as indexing. 

The main conclusion of my inspection of the Vampire kernel code is that the amount of work required 
for a transition to the new scheme is likely to exceed the cost of creating a rather advanced brand new 
prototype. An implementation from scratch can also be better tailored to the new design. Considering this 
additional advantage, my preference is clear. However, I do not dismiss the possibility of implementing the 
new scheme on the base of other advanced saturation-based provers. 

The nature of the proposed architectural principles is such that their advantages can only be fully demon- 
strated if a significant effort is invested in design and assessment of search heuristics. Indeed, the main ad- 
vantages of the new inference selection approach are the higher productivity of existing heuristics and the 
possibility of using new heuristics. 

This extra flexibility in directing proof search can only be fully exploited by means of tuning. Therefore, 
very extensive experimentation will be necessary to find generally good combinations of parameters of 
heuristics, as well as strategies specialised for important classes of problems 0- Strong tuning infrastructure 
to support such experimentation seems highly desirable. Developing such an infrastructure may be itself an 
interesting research problem. 

Finally, I would like to add a note about term indexing. The finer gradation of activeness divides clauses 
and their parts into many logically separate sets. An initial implementation may adapt existing techniques 

1 1 For experiments one can use the TPTP library 1 37 1, which is at the moment the largest and most diverse collection of first-order 
proof problems. It would also be very useful to look at more specialised large problem sets coming from applications in order to 
demonstrate the tunability of the proposed architecture. 
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to index these sets separately. However, better specialised indexing solutions may exist, and, if the proposed 
design proves viable in the initial experiments, it may give rise to a new line of research in term indexing. 

4 Generalisation-based prioritising of search directions 

4.1 Background: local syntactic relevancy estimation 

Blind search in indefinitely large spaces is usually not effective enough for most applications, so, all mod- 
ern saturation-based provers try to predict the relevancy of particular search directions by using various 
heuristics. In a saturation process state, the available search directions are identified by the accumulated 
clauses (e. g., the contents of the sets passive and active in the pick-given clause algorithm presented in 
Figured]). The most common heuristics prioritise search directions by giving some clauses higher priority 
for participating in inferences, than the others. The estimation of relevancy of a clause is based on such char- 
acteristics of the clause as its structural complexity (e.g., simpler clauses get higher priority) or its potential 
for participating in inferences (e.g., very prolific clauses get very low priority). 

Such approaches have natural limitations. The syntactic characteristics of a clause, used in the estima- 
tion, often fail to reflect the usefulness of the clause adequately. For example, a structurally complex clause 
may be absolutely indispensable for any solution of the problem at hand, but it will be suspended for a long 
time. Another problem is that the estimation is done locally, i.e. only one clause is analysed and global prop- 
erties of the current search state are not taken into account. For example, an absolutely irrelevant clause, i.e., 
participating in no minimal unsatisfiable subset of the current clause set, may be given high priority because 
of its simplicity. 

4.2 Generalisation-based prioritising of proof-search directions 

To address the issues raised above, I propose a method for intelligent prioritising of search directions. The 
idea is as follows. We will estimate the potential of a clause to participate in solutions of the whole problem 
at hand by interacting with other currently available clauses. Precise estimation is impossible since it would 
require finding all, or at least some, solutions of the problem, so we are looking for a good approximation. 

General method. I suggest to probe the search space by exploring a substantially simpler search space. 
The latter is obtained from the former by generalising some search directions. This is done by replacing 
(preferably large) clusters of similar clauses with their common generalisations. If we find a solution of the 
simplified problem, which involves the generalisation of a particular cluster, this is a good indication that at 
least some of the clauses in the cluster can be relevant. More importantly, the clauses whose generalisations 
have not yet proved useful, can be suspended as potentially irrelevant. Additionally, the closer a resolved 
generalisation is to a particular clause in its cluster, the better chances the clause has to participate in a 
solution and the bigger priority it should be given. 

Generalisations can be defined semantically: a clause C can be called a generalisation of clause D if 
C logically implies D. For our purposes, however, it is convenient to use a simpler, syntactically defined 
notion of generalisation, based on subsumption. In what follows, we will call C a generalisation of D if C 
subsumes D, i. e. CO C where CO and D are viewed as sets of literals. 

Implementation with naming and folding. Technically, the general approach described above can be re- 
alised by means of a combination of dynamic naming and folding. This combination is called decomposition 
rule in ifTTIl . but for the purposes of this paper it is convenient to consider the rules separately. 

12 More restrictive multiset-based variant of subsumption, where C6 is required to be a submultiset of D, can also be used. 
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The idea should be clear from the following example. Suppose we have a clause C\ = 
p(f(a,b)) V p(g(b,a)) V q(a). We decide that this clause is too specific and its generalisation 
A 0^1 j X2) = p(f(xi, X2)) V p(g(x2, 27)) should be explored first. To this end, we introduce a new binary 
(according to the number of variables in A) predicate 71 and make it the name for A- Logically, this can 
be viewed as introduction of the definition \/x\,X2- 71(27,2:2) ^ A (27, 2:2)- We immediately transform 
C\ by folding this definition into the following clause C[ = 7(0, b) V q(a). Moreover, if there are other 
clauses, currently stored or derived in the future, which are instances of the generalisation A, we can apply 
folding to them as well, thus recognising that the clauses are covered by the generalisation A- For example, 
if the clause C2 = p(f(h(a),b)) V p(g(b,h(a))) V r{b) is derived, it will be replaced by the clause 
C' 2 = 7i(/i(a), b) V r(b). The generalisation A is injected into the search space in the form of the clause 
A (27 , X2) V -171 (xi , X2), which is a logical consequence of the definition for 71. 

In order to obtain the behaviour prescribed by the general scheme, clauses containing ^-predicates 
(i.e., predicates which are generalisation names) are given special treatment. Namely, if a clause contains 
negatively the name 71 for the generalisation A, it means that the clause was derived from the clause 
A {xi , X2) V -171 (27 , X2) representing the generalisation A in the search space. In such clauses, we prohibit 
all inferences involving negative ^-literals (i.e., literals with 7-predicates) if there is at least one literal of 
a different kind. Roughly, in the clause A (27, 2:2) V -~ r fi(x\,X2) we want to resolve the generalisation 
part A (27,2:2) before we touch the literal -171(27, X2). Until this happens, the literal -171(27,2:2) on ly 
accumulates the substitution which solves A (27, 2:2)- 

When a clause containing only negative literals with generalisation names is derived, this indicates that 
some generalisations "fired", i.e. they contradict each other and some ordinary input clauses. The deriva- 
tion of such a clause can be viewed as a representation of a refutation of the clause set consisting of the 
involved generalisations and input clauses. We will call such clauses ^-contradictions and their inferences 
^-refutations. 

Clauses containing positive 7-literals are suspended (temporarily removed from the search state) until 
all of the corresponding generalisations have proved useful, i.e. every participating 7-predicate belongs to 
at least one 7-contradiction. When we can no longer suspend such a clause, we still block any inferences 
involving its non-7 literals. A resolution inference between such a clause and a 7-contradiction indicates 
that the clause is compatible with the corresponding 7-refutation, and it represents an attempt to (gradually) 
refine the 7-refutation into a solution for the original problem. If some form of paramodulation is used, 
we have to allow paramodulation into the positive 7-literals in an attempt to make them compatible with 
available 7-contradictions. 

To illustrate this, I continue the example. Suppose we have derived the 7-contradiction ->7i(a, b). The 
clauses C[ = 7(0, b) V q(a) and C' 2 = 71 (/i(a), 6) V r(b) can no longer be suspended. The clause C[ is 
directly compatible with the 7-refutation, which results in a derivation of the clause q{a). The clause C 2 is 
not compatible with the 7-refutation since 7i(/i(a), b) is not unifiable with 71(0, b). However, in presence 
of the unit equality clause h{a) ~ a, we can rewrite C 2 into 71 (a, b) V r(b), which is compatible with 
the 7-refutation, and then derive r(b). Note, that the work spent on refuting the generalisation A (modulo 
some ordinary input clauses) is utilised: we do not repeat the same inferences with the generalised literals 
from the original clause C\. Moreover, the results of this work are shared with another clause - C2, and, 
potentially, with many other clauses covered by the generalisation A- Such sharing of work on similar parts 
of potentially very many different clauses can be an additional advantage. 

Note that the proposed naming- and folding-based scheme is rather flexible. It allows many variants 
which may differ, e.g., in the way suspended clauses are treated, how selection of inferences is done with 
the 7-literals, how generalisations are chosen, how many generalisations can be applied to a single clause 
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and whether they can be overlapping H etc. The description above is only intended to provide a general 
framework for formulating such variants. Moreover, it is obviously not the only possible framework for 
implementing the general scheme presented in the beginning of this section. 

The proposed implementation scheme offers another advantage for free. The user gets an additional 
means of controlling proof search by specifying in the input which clauses he would like to make named 
generalisations from the start. This can be viewed as a way of hinting at useful lemmas (of a restricted kind 
since only clauses are named rather than arbitrary formulas) or suppressing search directions which do not 
seem promising to the user. 

For example, by analysing some previous proof attempts the user may conclude that many clauses of 
the form -*p(g(a, b)) V C are generated. If the user has reasons to believe that the literal -<p(g(a, b)) can be 
solved, i. e. p(g(a, b)) is logically implied by the input clauses, he may want to try proving p(g(a, b)) as a 
lemma, and later use it to resolve with the clauses -<p(g(a, b)) V C. Practically, this can be done by making 
->p(g(a, b)) a generalisation and giving it some name, e. g. 73. Refuting ->p(g(a, b)) V -173 corresponds 
to proving the lemma p(g(a, b)), and resolutions between the 7-contradiction ^73 and clauses of the form 
73 V C correspond to applications of the lemma. Such lemma hinting may be beneficial because it allows to 
share the work on solving literals ->p(g(a, b)) in many different clauses instead of solving them separately. 

If the user has reasons to believe that the literal ->p(g(a, b)) cannot be solved, and thus all the clauses 
->p(g(a, b)) V C are redundant, it still makes sense to make ->p(g(a, b)) a generalisation. This will keep 
the generalised clauses ->p(g(a, b)) V C away from inferences without completely discarding them. Only 
if the user's intuition was incorrect, i. e. ->p(g(a,b)) can actually be solved, the generalised clauses are 
reintroduced in the search space. 

4.3 Related work 

Static relevancy prediction. My original idea was to use some sort of clause abstractions for dynamic 
suppressing of potentially irrelevant search directions in the framework of saturation-based reasoning. This 
idea was inspired by Q where the authors propose to use various clause abstractions for statically identi- 
fying input clauses which are practically irrelevant, i.e. can not be useful in a proof attempt of acceptable 
complexity. Roughly, this is done by applying abstractions to an input clause set, exploring the space of all 
proofs of restricted complexity with the abstracted clause set, and throwing away the input clauses whose 
abstractions do not participate in any of the obtained proofs with the abstracted set. 

Iterative generalisation-refinement. Some time ago [29 ] drew my attention to the simplest kind of clause 
abstractions - generalisations, which seems convenient for our purposes. The method works roughly as fol- 
lows. A resolution prover is parameterised by a generalisation function on clauses, i.e. a function which com- 
putes several, possibly overlapping, generalisations for a given clause. When the prover is run on a problem, 
the generalisation mechanism replaces suitable clauses by their generalisations. The whole scheme works 
as iteration through levels of generalisation strength. First, the prover is run with a strong generalisation 
function to enumerate all refutations with depth below a certain limit. Then the generalisation function is 
weakened^! and the prover uses the previously found refutations to guide the enumeration of refutations with 
the new generalisation function. The key idea is that the refutations with the weaker generalisation function 

13 Intuitively, two generalisations of a clause C overlap if they cover some common literals in C. For example, C\ = 
p(f(a,b)) V p(g(b,a)) V q(a) has overlapping generalisations p(f(xi, X2)) V p(g(x2, Xi)) and p(g(xi, X2)) V q(x%) 
because the literals p(g(x2, £1)) and p(g(x2, Xi)) both generalise the literal p(g(b, a)). 

14 Roughly, a weaker generalisation function produces more specific generalisations of a given clause. 
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are in a certain (strict) sense refinements of the refutations obtained with the stronger generalisation func- 
tion. Such refinement is performed repeatedly and at some point the prover tries to refine a refutation from 
the previous step into a refutation which uses no generalisation. 

Octopus approach. The Octopus system 11251 runs a large number of sessions of the prover Theo If24l 
distributed over a cluster of computers. Each Theo session first runs on a weakening of the original problem, 
obtained by replacing one of the clauses with one of its generalisations. If one of the sessions succeeds in 
solving the weakened problem, the solution is used to direct the search for a solution of the original problem 
in two ways: 

- The unmodified clauses from the original problem formulation, which participate in the solution of the 
weakened problem, are considered to be heuristically more relevant. In the future searches for solutions 
of the original problem, these clauses are given higher priority. 

- Some clauses in the obtained refutation of the generalised clause set, which were derived from unmodi- 
fied clauses, are added as lemmas to the problem formulation. 

The main difference between my approach and the static relevancy prediction approach of [7], and also 
the Octopus approach [25], is that our clause generalisations are introduced dynamically, and can be used 
on derived clauses. This allows a good degree of adaptivity. 

My approach is closer to, and can be viewed as an attempt to revive the line of work presented in [29]. 
I hope to improve on this approach mainly by enumerating generalised refutations lazily, thus avoiding any 
artificial limits on the complexity of refutations and the need to enumerate a whole, potentially large, set 
of generalised refutations before we try to use these refutations. Also, my approach is more semantic in its 
nature since we do not try to refine generalised refutations by following their structure. We are interested in 
existence of 7-refutations rather than their shape. This allows much easier integration with various variants 
of resolution- and superposition-based inference systems. Additionally, my approach imposes no restric- 
tions on how the generalisation functions are specified and implemented. In particular, the generalisation 
mechanism can be adaptive. For example, the strength of generalisation may depend on various properties 
of clauses being generalised, or even on some global properties of the current search state. 

The general method is also partially inspired by, and shares some philosophical ideas with [28 ] and ifTOll . 

The use of naming and folding is a natural continuation of our joint work with Andrei Voronkov on 
implementing splitting without backtracking [31] and also partially stems from an unfinished attempt by 
the author to mimic tableaux without backtracking [9] in the context of saturation. Recently I have dis- 
covered that |[T7l proposes to use exactly the same combination of naming and folding, under the name of 
decomposition rule, for deciding two description logics and query answering in one of them. 

Semantic guidance in the style of SCOTT. To conclude the overview of relevant work, I would like to 
mention another approach which is technically unrelated to the one proposed here, but which also provides 
an alternative to local syntactic relevancy estimation. 

The semantic guidance approach, developed within the SCOTT project [15], is roughly as follows. The 
prover tries to establish satisfiability of several sets of stored clauses (in SCOTT this is done with the help 
of an external model builder). Ideally, these sets must approximate their maximal satisfiable supersets as 
closely as possible. The sets are used for guiding clause selection roughly as follows: clauses participating 
in fewer such satisfiable sets are given higher priority for selection. The intuition behind this approach is that 
a clause is more likely to be redundant if it participates in many satisfiable sets. This heuristic is supported 
by the fact that if a clause is in every maximal consistent subset, then it is definitely redundant. 
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The applicability of the semantic guidance approach seems limited because it relies on the costly op- 
eration of establishing satisfiability of large clause sets. This overhead may be acceptable in solving very 
hard problems when the user can afford to run a prover for hours or even days. Many applications, however, 
require solving large numbers of simpler problems and much quicker response. I hope that generalisation- 
based guidance can be more useful for this kind of applications because the associated overhead seems more 
manageable due to the flexibility of generalisation function choice. Anyway, a meaningful comparison of 
the two approaches can only be done experimentally, when at least one variant of the generalisation-based 
method is implemented. 



4.4 Methodological considerations 

Certain theoretical effort is required to formulate the method in full detail. It makes sense to consider a 
number of variants of the method and try to predict their strengths and weaknesses. It is also essential to have 
a clear picture of how the proposed use of generalisations will interact with the popular inference systems 
based on resolution, paramodulation and standard simplification techniques. In particular, it is necessary to 
consider the search completeness issues. 

The effectiveness of the method is likely to depend strongly on the choice of generalisation functions 
and, therefore, a significant effort to find adequate heuristics would be well justified. In particular, anybody 
implementing the method is very likely to encounter the problem of overgeneralisation. Working with too 
strong generalisations of clauses may potentially lead to numerous 7-refutations that are not compatible 
with any of the covered clauses. For example, if we fold the definition \/x.j(x) p(x) into the clause 
p(f{a)), transforming it into 7(/(a)), we may later derive some 7-refutation -17(6) which is incompatible 
with 7(/(a)). The work on deriving -17(6) is potentially wasted, unless, of course, there are other clauses 
compatible with -17(6). Another problem with overgeneralisation is that 7-refutations compatible with many 
clauses may be found quickly, and will activate the corresponding clauses. In such cases the work spent on 
creating generalisations themselves and their application to clauses, is wasted because the generalisations 
do not fulfill their mission of suspending clauses. On the other hand, too weak generalisations may also 
be bad, e. g., because they cover too small sets of clauses, in which case their construction is not properly 
amortised. I hope these considerations illustrate the thesis about importance of searching for heuristics for 
choosing effective generalisation functions. 

In contrast with the fine inference selection scheme which essentially requires creating a new imple- 
mentation, the generalisation-based search guidance can be relatively easily integrated into some existing 
provers, especially if it is implemented with naming and folding as outlined earlier. My experience with im- 
plementing splitting- without-backtracking iPTj (see also Chapter 5 in [30]) in the Vampire kernel suggests 
that only a moderate effort is required to implement naming and folding on the base of a reasonably man- 
ageable implementation of forward subsumption, which is a standard feature in advanced saturation-based 
provers. 

The most difficult task is likely to be the design and implementation of a flexible, yet manageable, 
mechanism for specifying generalisation functions, and to provide a higher-level interface for this mech- 
anism which would enable productive use of heuristics. The reliance on heuristics also implies that very 
extensive experimentation will be required to assess the general effectiveness of the method and to compare 
its variants. 
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